R Markdown

install.packages('fivethirtyeightdata', repos = 'https://fivethirtyeightdata.github.io/drat/', type = 'source')
## Installing package into '/stor/home/sm69929/R/x86_64-pc-linux-gnu-library/3.6'
## (as 'lib' is unspecified)
## Warning in install.packages("fivethirtyeightdata", repos = "https://
## fivethirtyeightdata.github.io/drat/", : installation of package
## 'fivethirtyeightdata' had non-zero exit status
library(fivethirtyeight)
## Some larger datasets need to be installed separately, like senators and
## house_district_forecast. To install these, we recommend you install the
## fivethirtyeightdata package by running:
## install.packages('fivethirtyeightdata', repos =
## 'https://fivethirtyeightdata.github.io/drat/', type = 'source')
bad_drivers <- bad_drivers
partisan_lean <- partisan_lean_state
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.1
## ✓ tidyr   1.1.1     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(cluster)

Introduction

I have chosen the datasets bad_drivers and partisan_lean. The dataset bad_drivers gives a plethoa of information on the worst drivers from each U.S. state and D.C. The dataset has 51 total rows that represent the 50 states and District of Columbia, and 7 additonal variables that give numeric data about the drivers accidents. The variables include state, num_drivers, perc_speeding, perc_alcohol, perc_not_distracted, perc_no_previous, insurance_premiums, and losses. Therse were all representative of those involved in fatal collisons. Losses is based on insurance losses. The dataset partisan_lean has 50 rows and 3 columns describing data with state, the 50 states, pvi_party, the party of their vote (Democratic or Republican), and pvi_amount, which is the Cook Partisan Voting Index of the vote. I found these datasets when browsing the packages given to us in the project insructions from the CRAN package fivethirtyeight. I found bad_drivers interesting because I wonder which state really has the worst drivers! I feel like driving in Austin makes Texas a top choice, but I will see! I chose partisan_lean because I wanted to see if the direction a state leans politically correlates or has associations with how bad the drivers are. I am keeping an open mind on counfounding variables like population, education, and more.

Joining/Merging

partydrivers <- partisan_lean %>% left_join(bad_drivers)
## Joining, by = "state"

My datasets were both tidy! Every value belongs to a variable and an observation. Because they were tidy, I was ready to join the datasets. I did a left join on partisan_lean. I chose left join because I only wanted the rows from bad_drivers that the same rows of states as partisan_lean. Bad_drivers had information over District of Columbia, while partisan_lean was lacking District of Columbia. I would not have been able to compare and analyze any correlational data between political association and the bad drivers in the District of Columbia.

Wrangling

#Filter
partydrivers %>% filter(pvi_party=="D", insurance_premiums<935)
## # A tibble: 11 x 10
##    state pvi_party pvi_amount num_drivers perc_speeding perc_alcohol
##    <chr> <fct>          <dbl>       <dbl>         <int>        <int>
##  1 Cali… D                 24        12              35           28
##  2 Colo… D                  1        13.6            37           28
##  3 Hawa… D                 36        17.5            54           41
##  4 Illi… D                 13        12.8            36           34
##  5 Maine D                  5        15.1            38           30
##  6 Minn… D                  2         9.6            23           29
##  7 New … D                  7        18.4            19           27
##  8 Oreg… D                  9        12.8            33           26
##  9 Verm… D                 24        13.6            30           30
## 10 Virg… D                  0        12.7            19           27
## 11 Wash… D                 12        10.6            42           33
## # … with 4 more variables: perc_not_distracted <int>, perc_no_previous <int>,
## #   insurance_premiums <dbl>, losses <dbl>
#Select
partydrivers %>% select(state, contains("perc"))
## # A tibble: 50 x 5
##    state       perc_speeding perc_alcohol perc_not_distracted perc_no_previous
##    <chr>               <int>        <int>               <int>            <int>
##  1 Alabama                39           30                  96               80
##  2 Alaska                 41           25                  90               94
##  3 Arizona                35           28                  84               96
##  4 Arkansas               18           26                  94               95
##  5 California             35           28                  91               89
##  6 Colorado               37           28                  79               95
##  7 Connecticut            46           36                  87               82
##  8 Delaware               38           30                  87               99
##  9 Florida                21           29                  92               94
## 10 Georgia                19           25                  95               93
## # … with 40 more rows
#Arrange
partydrivers %>% arrange(desc(losses))
## # A tibble: 50 x 10
##    state pvi_party pvi_amount num_drivers perc_speeding perc_alcohol
##    <chr> <fct>          <dbl>       <dbl>         <int>        <int>
##  1 Loui… R                 17        20.5            35           33
##  2 Mary… D                 23        12.5            34           32
##  3 Okla… R                 34        19.9            32           29
##  4 Conn… D                 11        10.8            46           36
##  5 Cali… D                 24        12              35           28
##  6 New … D                 13        11.2            16           28
##  7 Texas R                 17        19.4            40           38
##  8 Miss… R                 15        17.6            15           31
##  9 Tenn… R                 28        19.5            21           29
## 10 Penn… R                  1        18.2            50           31
## # … with 40 more rows, and 4 more variables: perc_not_distracted <int>,
## #   perc_no_previous <int>, insurance_premiums <dbl>, losses <dbl>
#Mutate, Group_By 
partydrivers %>% group_by(pvi_party) %>% mutate(mean2 = cummean(perc_alcohol)) %>% arrange(desc(mean2))
## # A tibble: 50 x 11
## # Groups:   pvi_party [2]
##    state pvi_party pvi_amount num_drivers perc_speeding perc_alcohol
##    <chr> <fct>          <dbl>       <dbl>         <int>        <int>
##  1 Illi… D                 13        12.8            36           34
##  2 Mass… D                 29         8.2            23           35
##  3 Hawa… D                 36        17.5            54           41
##  4 Maine D                  5        15.1            38           30
##  5 Mary… D                 23        12.5            34           32
##  6 Mich… D                  1        14.1            24           28
##  7 Minn… D                  2         9.6            23           29
##  8 New … D                 13        11.2            16           28
##  9 New … D                  7        18.4            19           27
## 10 Rhod… D                 26        11.1            34           38
## # … with 40 more rows, and 5 more variables: perc_not_distracted <int>,
## #   perc_no_previous <int>, insurance_premiums <dbl>, losses <dbl>, mean2 <dbl>
#Summary Statistics 

partydrivers %>%  summarize(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
## # A tibble: 1 x 8
##   pvi_amount num_drivers perc_speeding perc_alcohol perc_not_distra…
##        <dbl>       <dbl>         <dbl>        <dbl>            <dbl>
## 1       16.9        16.0          31.7         30.8             85.6
## # … with 3 more variables: perc_no_previous <dbl>, insurance_premiums <dbl>,
## #   losses <dbl>
partydrivers %>%  summarize(across(where(is.numeric), ~ sd(.x, na.rm = TRUE)))
## # A tibble: 1 x 8
##   pvi_amount num_drivers perc_speeding perc_alcohol perc_not_distra…
##        <dbl>       <dbl>         <dbl>        <dbl>            <dbl>
## 1       11.5        3.91          9.73         5.16             15.2
## # … with 3 more variables: perc_no_previous <dbl>, insurance_premiums <dbl>,
## #   losses <dbl>
partydrivers %>%  summarize(across(where(is.numeric), ~ var(.x, na.rm = TRUE)))
## # A tibble: 1 x 8
##   pvi_amount num_drivers perc_speeding perc_alcohol perc_not_distra…
##        <dbl>       <dbl>         <dbl>        <dbl>            <dbl>
## 1       133.        15.3          94.6         26.6             230.
## # … with 3 more variables: perc_no_previous <dbl>, insurance_premiums <dbl>,
## #   losses <dbl>
partydrivers %>%  summarize(across(where(is.numeric), ~ quantile(.x, na.rm = TRUE)))
## # A tibble: 5 x 8
##   pvi_amount num_drivers perc_speeding perc_alcohol perc_not_distra…
##        <dbl>       <dbl>         <dbl>        <dbl>            <dbl>
## 1          0         8.2            13           16             10  
## 2          7        12.8            23           28             82.5
## 3         17        15.6            34           30             88  
## 4         24        18.6            38           33             94.8
## 5         47        23.9            54           44             99  
## # … with 3 more variables: perc_no_previous <dbl>, insurance_premiums <dbl>,
## #   losses <dbl>
partydrivers %>%  summarize(across(where(is.numeric), ~ min(.x, na.rm = TRUE)))
## # A tibble: 1 x 8
##   pvi_amount num_drivers perc_speeding perc_alcohol perc_not_distra…
##        <dbl>       <dbl>         <int>        <int>            <int>
## 1          0         8.2            13           16               10
## # … with 3 more variables: perc_no_previous <int>, insurance_premiums <dbl>,
## #   losses <dbl>
partydrivers %>%  summarize(across(where(is.numeric), ~ max(.x, na.rm = TRUE)))
## # A tibble: 1 x 8
##   pvi_amount num_drivers perc_speeding perc_alcohol perc_not_distra…
##        <dbl>       <dbl>         <int>        <int>            <int>
## 1         47        23.9            54           44               99
## # … with 3 more variables: perc_no_previous <int>, insurance_premiums <dbl>,
## #   losses <dbl>
partydrivers %>%  summarize(across(where(is.numeric), ~ median(.x, na.rm = TRUE)))
## # A tibble: 1 x 8
##   pvi_amount num_drivers perc_speeding perc_alcohol perc_not_distra…
##        <dbl>       <dbl>         <dbl>        <dbl>            <dbl>
## 1         17        15.6            34           30               88
## # … with 3 more variables: perc_no_previous <dbl>, insurance_premiums <dbl>,
## #   losses <dbl>
partydrivers %>%  summarize(across(where(is.numeric), ~ n_distinct(.x, na.rm = TRUE)))
## # A tibble: 1 x 8
##   pvi_amount num_drivers perc_speeding perc_alcohol perc_not_distra…
##        <int>       <int>         <int>        <int>            <int>
## 1         29          44            29           19               25
## # … with 3 more variables: perc_no_previous <int>, insurance_premiums <int>,
## #   losses <int>
partydriversonlynum <- partydrivers %>% select_if(is.numeric)
partydriversonlynum %>% cor
##                       pvi_amount num_drivers perc_speeding perc_alcohol
## pvi_amount           1.000000000  0.28797287   0.101069621   0.14192790
## num_drivers          0.287972873  1.00000000  -0.018663595   0.17578538
## perc_speeding        0.101069621 -0.01866360   1.000000000   0.29140608
## perc_alcohol         0.141927903  0.17578538   0.291406080   1.00000000
## perc_not_distracted  0.108741412  0.05932482   0.128472265   0.05780096
## perc_no_previous     0.004789969  0.06712173   0.006442366  -0.22911225
## insurance_premiums  -0.067215556 -0.10465864   0.033770006   0.01517102
## losses              -0.037991143 -0.03506761  -0.061579945  -0.08344099
##                     perc_not_distracted perc_no_previous insurance_premiums
## pvi_amount                   0.10874141      0.004789969       -0.067215556
## num_drivers                  0.05932482      0.067121733       -0.104658639
## perc_speeding                0.12847227      0.006442366        0.033770006
## perc_alcohol                 0.05780096     -0.229112249        0.015171025
## perc_not_distracted          1.00000000     -0.234326992       -0.022855291
## perc_no_previous            -0.23432699      1.000000000        0.004128919
## insurance_premiums          -0.02285529      0.004128919        1.000000000
## losses                      -0.06018868      0.041835397        0.652502452
##                          losses
## pvi_amount          -0.03799114
## num_drivers         -0.03506761
## perc_speeding       -0.06157994
## perc_alcohol        -0.08344099
## perc_not_distracted -0.06018868
## perc_no_previous     0.04183540
## insurance_premiums   0.65250245
## losses               1.00000000
partydrivers %>% group_by(pvi_party) %>% summarize(mean = mean(num_drivers), sd = sd(num_drivers))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 3
##   pvi_party  mean    sd
##   <fct>     <dbl> <dbl>
## 1 D          12.9  2.58
## 2 R          17.9  3.36
partydrivers %>% group_by(pvi_party) %>% summarise(median = median(perc_alcohol), n = n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 3
##   pvi_party median     n
##   <fct>      <int> <int>
## 1 D             30    19
## 2 R             30    31
#Summary statistics Visualizations & Tidying to rearrange wide/long

partydrivers2 <- partydrivers
names(partydrivers2)<-gsub("\\_","",names(partydrivers2))


partydrivers2 %>%  summarize_if(is.numeric,.funs = list("mean"=mean,"median"=median, "sd"=sd, "max"=max, "min"=min, "var"=var, "ndistinct" = n_distinct), na.rm=T) %>%
pivot_longer(contains("_"))%>%
separate(name,into=c("Variable","Statistics"), sep="_", convert = T)%>%
pivot_wider(names_from = "Variable",values_from="value")%>% arrange(Statistics)
## # A tibble: 7 x 9
##   Statistics pviamount numdrivers percspeeding percalcohol percnotdistract…
##   <chr>          <dbl>      <dbl>        <dbl>       <dbl>            <dbl>
## 1 max             47        23.9         54          44                99  
## 2 mean            16.9      16.0         31.7        30.8              85.6
## 3 median          17        15.6         34          30                88  
## 4 min              0         8.2         13          16                10  
## 5 ndistinct       29        44           29          19                25  
## 6 sd              11.5       3.91         9.73        5.16             15.2
## 7 var            133.       15.3         94.6        26.6             230. 
## # … with 3 more variables: percnoprevious <dbl>, insurancepremiums <dbl>,
## #   losses <dbl>

When making my condensed table of summary statistics, I needed to tidy the data to make it less wide. There were repeating columns with the same variable, such as multiple columns with summary statistics like mean, and multiple columns with the same variable such as numdrivers used again. In order to condense this, I took used pivot_longer to lengthen the table with the same data being presented in a vertical fashion. Once calculating summary statistics, all columns included an underscore, which made it simple to pivot based on that. After I pivoted longer, I used separate to put the separated column names by the underscore into their one column called Variable. I placed all of the summary functions in a column called Statistics. I then used pivot_wider to place the original column data from partydrivers2 into each of their own columns with summary statistics included.

Visualizations

Correlation Heatmap

partydrivedf <- partydrivers %>% select_if(is.numeric) %>% cor()
partydrivedf
##                       pvi_amount num_drivers perc_speeding perc_alcohol
## pvi_amount           1.000000000  0.28797287   0.101069621   0.14192790
## num_drivers          0.287972873  1.00000000  -0.018663595   0.17578538
## perc_speeding        0.101069621 -0.01866360   1.000000000   0.29140608
## perc_alcohol         0.141927903  0.17578538   0.291406080   1.00000000
## perc_not_distracted  0.108741412  0.05932482   0.128472265   0.05780096
## perc_no_previous     0.004789969  0.06712173   0.006442366  -0.22911225
## insurance_premiums  -0.067215556 -0.10465864   0.033770006   0.01517102
## losses              -0.037991143 -0.03506761  -0.061579945  -0.08344099
##                     perc_not_distracted perc_no_previous insurance_premiums
## pvi_amount                   0.10874141      0.004789969       -0.067215556
## num_drivers                  0.05932482      0.067121733       -0.104658639
## perc_speeding                0.12847227      0.006442366        0.033770006
## perc_alcohol                 0.05780096     -0.229112249        0.015171025
## perc_not_distracted          1.00000000     -0.234326992       -0.022855291
## perc_no_previous            -0.23432699      1.000000000        0.004128919
## insurance_premiums          -0.02285529      0.004128919        1.000000000
## losses                      -0.06018868      0.041835397        0.652502452
##                          losses
## pvi_amount          -0.03799114
## num_drivers         -0.03506761
## perc_speeding       -0.06157994
## perc_alcohol        -0.08344099
## perc_not_distracted -0.06018868
## perc_no_previous     0.04183540
## insurance_premiums   0.65250245
## losses               1.00000000
partydrivedf2 <- partydrivedf %>% as.data.frame()

tidyparty <- partydrivedf2 %>% rownames_to_column("var1") %>% 
pivot_longer(-1, names_to="var2", values_to="correlation")

tidyparty %>% ggplot(aes(var1, var2, fill=correlation)) + geom_tile() + scale_fill_gradient2(low="purple", mid="white", high="red") + geom_text(aes(label=round(correlation,2)),color = "black", size = 2)+ 
theme(axis.text.x = element_text(angle = 90, hjust = 1))+  coord_fixed()+ ggtitle("Correlation Heatmap")

Plot 1

partydrivers %>% ggplot(aes(losses, insurance_premiums, color=pvi_party)) + xlab("Losses Incurred Per Insured Drivers Collisons ($)") + ylab("Car Insurance Premiums ($)") + ggtitle("Car Insurance Premiums vs  Insurance Company Collision Losses Per Party ") + geom_point()+ theme_bw() +scale_x_continuous(n.breaks=15) + geom_smooth(method = "lm") + scale_color_manual(values = c("#0C0CDE", "#D51717"))
## `geom_smooth()` using formula 'y ~ x'

The graph shows a positive correlation on both of our trendlines. In both Democratic and Republican states, higher losses result in higher premiums, presumably to make up for the losses. Looking at the trendlines between the parties, it appears citizens in Democratic states tend to pay more in insurance premiums overall than those in Republican states. The outliers in blue states tend to be higher than the trend line and the outliers in red states tend to be below the trendline. Presumably you could be potentially paying higher rates in the blue states. The trendlines do not start at the same point, and the minimum cost tends to be lower in the red states than in the blue states. The confidence interval on the higher end of losses are lacking in points, and it must be taken with a grain of salt that this positive, linear correlation would continue.

Plot 2

partydriversinsur <- partydrivers %>%mutate(insurance_rate = case_when(insurance_premiums>1074 ~ "high",
                                          insurance_premiums<=1074 & 744<=insurance_premiums ~ "med",
                                            insurance_premiums<744 ~ "low"))
partydriversinsur
## # A tibble: 50 x 11
##    state pvi_party pvi_amount num_drivers perc_speeding perc_alcohol
##    <chr> <fct>          <dbl>       <dbl>         <int>        <int>
##  1 Alab… R                 27        18.8            39           30
##  2 Alas… R                 15        18.1            41           25
##  3 Ariz… R                  9        18.6            35           28
##  4 Arka… R                 24        22.4            18           26
##  5 Cali… D                 24        12              35           28
##  6 Colo… D                  1        13.6            37           28
##  7 Conn… D                 11        10.8            46           36
##  8 Dela… D                 14        16.2            38           30
##  9 Flor… R                  5        17.9            21           29
## 10 Geor… R                 12        15.6            19           25
## # … with 40 more rows, and 5 more variables: perc_not_distracted <int>,
## #   perc_no_previous <int>, insurance_premiums <dbl>, losses <dbl>,
## #   insurance_rate <chr>
partydriversinsur %>% ggplot(aes(x =pvi_party , y =perc_no_previous , fill=insurance_rate))+
  geom_bar(stat="summary", fun=mean, position="dodge") + scale_fill_manual(values=c("blue", "dark green", "purple"), 
                       name="National Insurance Rate",
                  
                       labels=c("High Rate", "Medium Rate", "Low Rate")) + xlab("Political Party") + ylab("Percentage of Drivers with No Previous Accidents") + ggtitle("Insurance Rate vs Rate of Previous Accidents Per Party") 

This barplot is answering the question, for the states that have a high, medium, or low insurance rate on average, what percentage of their drivers have no previous accidents? This is grouped by the states that lean Democrat and Republican. Based on the aggregate percentage of "previous accidents", one could claim that Republican states tend to have less accidents than those of Democratic states. For the states that have medium and low insurance rates on average, regardless of the political affiliation, tend to have simmilar accident histories. In Republican states, there is a greater number of citizens with fewer previous accidents that pay a higher insurance rate. In Democratic states, there is a lesser number of citizens with previous accidents that pay a higher insurance rate. There may be a counfounding variable in the disparity between the high rate paying Democratic and Republican states.

Clustering

PAM

clust_dat <-partydrivers %>% select(-state, -pvi_party) %>% scale %>% as.data.frame

pam_dat<-partydrivers%>%select(-state,-pvi_party)
sil_width<-vector()
for(i in 2:10){  
  pam_fit <- pam(pam_dat, k = i)  
  sil_width[i] <- pam_fit$silinfo$avg.width  
}
ggplot()+geom_line(aes(x=1:10,y=sil_width))+scale_x_continuous(name="k",breaks=1:10)
## Warning: Removed 1 row(s) containing missing values (geom_path).

pam1 <- clust_dat %>% scale %>% pam(k=3)

pamclust <- clust_dat %>% mutate(cluster=as.factor(pam1$clustering))
pamclust %>% ggplot(aes(insurance_premiums, losses, num_drivers, color=cluster )) + geom_point()

library(plotly)
pamclust %>%plot_ly(x= ~insurance_premiums, y = ~losses, z = ~num_drivers, color= ~cluster,
type = "scatter3d", mode = "markers") %>%
layout(autosize = F, width = 900, height = 400)
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
library(GGally)
ggpairs(pamclust, columns=1:8, aes(color=cluster))

pamclust %>% group_by(cluster) %>% summarize_if(is.numeric, mean, na.rm=T)
## # A tibble: 3 x 9
##   cluster pvi_amount num_drivers perc_speeding perc_alcohol perc_not_distra…
##   <fct>        <dbl>       <dbl>         <dbl>        <dbl>            <dbl>
## 1 1            0.232      0.0942        0.511         0.578            0.147
## 2 2           -0.496      0.0315       -0.635        -0.557           -0.295
## 3 3            0.401     -0.304        -0.0356       -0.363            0.214
## # … with 3 more variables: perc_no_previous <dbl>, insurance_premiums <dbl>,
## #   losses <dbl>
partydrivers %>% slice(pam1$id.med)
## # A tibble: 3 x 10
##   state pvi_party pvi_amount num_drivers perc_speeding perc_alcohol
##   <chr> <fct>          <dbl>       <dbl>         <int>        <int>
## 1 Miss… R                 19        16.1            43           34
## 2 Geor… R                 12        15.6            19           25
## 3 Verm… D                 24        13.6            30           30
## # … with 4 more variables: perc_not_distracted <int>, perc_no_previous <int>,
## #   insurance_premiums <dbl>, losses <dbl>
pam1$silinfo$avg.width
## [1] 0.0630808
plot(pam1, which=2)

In order to do PAM, I created my dataset called clust_dat, extracting columns from my original dataset, which had the 8 numeric variables I wanted to analyze. It was important to scale my data in case the variables were measured on different scales. I used silhouette width to get my number of clusters! I computed the silhouette width then took the average. I viewed the result with ggplot and chose the highest point on the graph, which was 2 clusters. I took my data from clust_dat and used the pam function. I attributed this to a new variable called pam1. After calculating the number of clusters needed and visualizing them on ggplot, I created a new vector called pamclust by taking clust_dat, which has my numeric variables, then used mutate to add a new variable called cluster in my dataset. I created cluster with data from my dataset pam1, which has my clustering vector. After that, I put it into ggplot, coloring by cluster to visualize my final cluster solution! The medoids for cluster 1, 2, and 3 were 0.04, 0.02, and 0.22, respectively. Looking at the clusters, they were not tightly spread from each other and in fact overlapped over the entire scatterplot. They hardly represented clustered at all. I viewed it in plotly which selected 3 of my variables then Ggally which showed all of the variables. Ggally did not show any instances of separated clusters. After I used PAM, I grouped by the cluster and summarized to figure out the means for each variable. I used the slice function to look at the states that are most representative of their cluster, which was Missouri, Georgia, and Vermont. I ran my average silhouette width based on my first pam that I ran to look at how good the solution was! I got 0.063 as the representative number. I created a plot of pam1 to visualize and see the averages of different variables and my overall average silhouette width, which was 0.06. Based on the width, the structure has validity. Overall, the structure s completely unusable and uninformative and our data is not valid.